Promises and Challenges of populational Proteomics in Health and Disease

Advances in proteomic assay technologies have significantly increased coverage and throughput, enabling recent increases in the number of large-scale population-based proteomic studies of human plasma and serum. Improvements in multiplexed protein assays have facilitated the quantification of thousands of proteins over a large dynamic range, a key requirement for detecting the lowest-ranging, and potentially the most disease-relevant, blood-circulating proteins. In this perspective, we examine how populational proteomic datasets in conjunction with other concurrent omic measures can be leveraged to better understand the genomic and non-genomic correlates of the soluble proteome, constructing biomarker panels for disease prediction, among others. Mass spectrometry workflows are discussed as they are becoming increasingly competitive with affinity-based array platforms in terms of speed, cost, and proteome coverage due to advances in both instrumentation and workflows. Despite much success, there remain considerable challenges such as orthogonal validation and absolute quantification. We also highlight emergent challenges associated with study design, analytical considerations, and data integration as population-scale studies are run in batches and may involve longitudinal samples collated over many years. Lastly, we take a look at the future of what the nascent next-generation proteomic technologies might provide to the analysis of large sets of blood samples, as well as the difficulties in designing large-scale studies that will likely require participation from multiple and complex funding sources and where data sharing, study designs, and financing must be solved.


Promises and Challenges of populational Proteomics in Health and Disease
Benjamin B. Sun 1,* , Karsten Suhre 2,3 , and Bradford W. Gibson 4   Advances in proteomic assay technologies have significantly increased coverage and throughput, enabling recent increases in the number of large-scale populationbased proteomic studies of human plasma and serum.Improvements in multiplexed protein assays have facilitated the quantification of thousands of proteins over a large dynamic range, a key requirement for detecting the lowest-ranging, and potentially the most disease-relevant, blood-circulating proteins.In this perspective, we examine how populational proteomic datasets in conjunction with other concurrent omic measures can be leveraged to better understand the genomic and non-genomic correlates of the soluble proteome, constructing biomarker panels for disease prediction, among others.Mass spectrometry workflows are discussed as they are becoming increasingly competitive with affinity-based array platforms in terms of speed, cost, and proteome coverage due to advances in both instrumentation and workflows.Despite much success, there remain considerable challenges such as orthogonal validation and absolute quantification.We also highlight emergent challenges associated with study design, analytical considerations, and data integration as population-scale studies are run in batches and may involve longitudinal samples collated over many years.Lastly, we take a look at the future of what the nascent next-generation proteomic technologies might provide to the analysis of large sets of blood samples, as well as the difficulties in designing large-scale studies that will likely require participation from multiple and complex funding sources and where data sharing, study designs, and financing must be solved.
The last 2 decades have seen rapid development in omics technology culminating in a combination of high-throughput methods to measure the genome, transcriptome and proteome at unprecedented scale (1).Genomics and transcriptomics are maturing to a stage where the vast majority of the entire genome or transcriptome can be ascertained through next generation DNA sequencing approaches.Dogmatically, the effects of genomic and transcriptomic changes principally culminate at the protein level, which constitutes the functional unit of biological processes.While proteomic technologies have made considerable advances, they lag behind genomics and transcriptomics in terms of overall coverage, cost and speed (2).That said, as described in this perspective, proteomics technologies in recent years have been carried out at a scale unthinkable just a few years ago.
There are multiple reasons why industry and academic groups take on the immense costs and challenges of conducting large-scale population-based studies (see Fig. 1 and Box 1).Comprehensive data on the levels of blood proteins has many critical uses, perhaps none more important than identifying novel gene polymorphisms that can alter the concentrations of one or more blood proteins, otherwise referred to as cis and trans pQTLs (3).These gene-to-protein correlations can reveal novel biology, and in some cases become diagnostic biomarkers of disease or lead to the identification of novel therapeutic targets.In many cases, these data can help determine the mechanism of action of a gene and its encoded protein's role in a particular disease process that, when mapped onto known biochemical pathways or proteinprotein interaction networks, can also reveal novel up or downstream therapeutic targets and disease intervention strategies (4).
Unlike a specific tissue or cell type, proteins in the blood originate through multiple mechanisms, from 'classical' active secretion to shedding and cell leakage (5).It is not surprising that the major blood cells including erythrocytes, white blood cells, and platelets, along with the endothelial cells of the vascular system, are major contributors.In addition to these sources, the liver plays a major role in the active secretions of many blood proteins, although other organs and tissues throughout the body such as muscle also make significant contributions.Indeed, it has been postulated that there is a subset of unique proteins that can be tracked to every tissue or organ of the body and that this information can be used to generate disease-specific biomarkers (6).
While the human genome contains "only" 20,000 proteinencoding genes, the size of the proteome is much larger, with potential magnitude increase due to alternate forms through genetic variation and alternative splicing.In addition, there is a multitude of posttranslational modifications (PTMs), many of which have important regulatory functions, such as phosphorylation and glycosylation (7).Given that proteins may contain multiple PTMs in various combinations, there may be at least 10 million distinct proteoforms in the human proteome (8).Not all proteins are necessarily present in any one particular biological fluid or tissue at the same time, and the proteins that are present span a large dynamic range.This is especially true of blood proteins, which have a much larger dynamic range than the cellular proteome, with proteins ranging from high mg/ml abundances such as albumin and transferrin, to proteins involved in regulation and cell-cell signaling such as transcription factors, cytokines, and tumor-specific antigens that are often present at sub-pg/ml levels or observable only under certain pathological conditions (5).Complete capture of the entire proteome still falls short in coverage relative to genomics, but rapid progress in proteomic approaches has ushered in unprecedented insights from large-scale proteomics, especially in human blood plasma.
Although some large studies have used the serum, it is not surprising given the ease of collection, processing, and storage, blood plasma has been the most common fluid matrix employed in large proteomics studies in population cohorts/ biobanks.There have been multiple reports examining preanalytical processing steps on analytical reproducibility including temperature, collection tube types, timing of centrifugations to length of storage.While each step can lead to variations in protein measurements, for plasma it is not surprising that the timing of the first centrifugation is most critical, as this key step removes blood cells that might otherwise degrade over time and release proteins that contaminate the plasma proteome (9).In any case, whether plasma, serum, or other sample types, it is critical that the methods employed for collection and processing across the entire cohort be rigorously adhered to, especially as the largest sample repositories are likely to span many years and even decades and often involve multiple collection sites (2,10).

BROAD OVERVIEW OF PROTEOMIC APPROACHES
Current proteomics technologies can be categorized into two main approaches: mass spectrometry (MS) and affinitybased, both of which come in various forms.In affinitybased assays, protein-specific antibodies or aptamers are used to selectively bind to each target protein where the final readout is obtained using either fluorescence or DNA sequencing (11).This is typically carried out in sets of 96 or 384 well plate formats that can cover less than a hundred to several thousand unique proteins up to nine orders of magnitude in abundance.Most common platforms carry out various dilution steps to compensate for the large dynamic range in blood protein concentrations before a series of assay steps are carried out and final measurements are taken (see Fig. 2).Mass spectrometry methods on the other hand traditionally rely on the analysis of peptides generated by proteolytic digestion of the protein samples that are then separated by liquid chromatography and subjected to tandem mass spectrometry (LC-MS 2 ) (see Fig. 2).There are many variations to the MS-based workflows, which can include depletion of high abundance proteins using multiplexed antigen removal columns (12), HPLC pre-fractionation of the peptide samples to generate separate pools of less complexity (13), multiplexed chemical labeling such as TMTpro (14), gas phase separation such as ion mobility (15), or nanoparticle enrichment (16).Most published MS protocols share most steps and ultimately rely on various commercial and academic software suites to make the final spectral assignments.For the newer dataindependent acquisition (DIA) workflows that many believe to be best suited for large-scale plasma proteomics studies due to their superior acquisition speeds required for largescale studies, the DIA software platforms such as Spectronaut, DIA-NN, MaxDIA, and Skyline have significantly improved data processing speed with increased proteome coverage and are the most widely used in these types of studies (17).

EMERGENCE OF SOLUBLE PROTEOMICS IN POPULATIONAL COHORTS
Maturation and advances in these populational cohorts/ biobanks and proteomic technologies have ushered in significant increases in both sample sizes and proteins assayed, as exemplified by how proteogenomic studies have evolved over the last decade or so (Fig. 3).Cohorts with sample sizes beyond 10,000 measuring more than 1000 proteins are beginning to emerge more frequently since 2020 (4,18,19), with the vast majority of studies still predominantly focusing on Caucasian populations.In large scale (>5000) population sized cohorts, affinity-based methods, largely due to their

Promises and Challenges of Population Proteomics in Health
Mol Cell Proteomics (2024) 23( 7) 100786 3 high-throughput characteristics and cost considerations, are currently the predominant approach.Recent proteomics studies in biobank-scale cohorts such as deCODE (19) and UKB-PPP (4) have demonstrated feasibility of affinity-based proteomics at tens of thousands of samples measured over a span of months, paving the way for potential future scale-up to entire biobank sizes of >100,000 samples, currently prohibitive for MS methods.
As mentioned earlier, despite recent improvements in both throughput and sensitivity, MS-based proteomics still lags behind epitope-array-based technologies in the throughput needed for routine measurements beyond thousands of samples as well as the inherently more limited dynamic range needed to detect and quantify proteins at lower concentrations (<ng/ml).An effort at the scale of UKB-PPP (~50,000 samples) would likely have taken several years or a very high degree of platform multiplexing with standard MS set-ups available to us at the start of that effort in 2021 and would likely not have captured many low abundance blood proteins, some of which are well known to be directly linked to specific disease states.In part, this is attributable to the inherently more complex workflow and processing pipeline of intermediate data generated by MS methods, which can introduce technical variation as well as being time consuming, as opposed to the relatively simpler multiplexing and automatable set-up of affinity-based methods.Despite this, MS studies in thousands of individuals (including non-Europeans) with and without genomic information are beginning to emerge (20,21).
Unlike genomics data, the size of current affinity-based proteomics data generated in populational cohorts is relatively modest in comparison.Taking UKB-PPP as an example, in its simplest form, the data is essentially a matrix of ~50,000 individuals by ~3000 protein analytes, taking up approximately 3 GB of storage which can feasibly run locally.In comparison, the same type of bulk transcriptomic data (~18,000 gene transcripts) is ~6 times larger, and genomic data (without compression) measuring 12 million variants is ~4000 times larger, requiring higher order storage and processing capabilities in the form of cluster or distributed cloud computing.Even with the recent expansion of the Olink and SomaLogic assay coverage to between 5400 and 10,000 unique human proteins respectively, the size remains relatively modest.On the other hand, MS proteomics can generate potentially comparably large data to genomics given the more complex intermediate and peptide level data, which will also require significant aggregation and processing to get to much smaller protein level abundance read-outs per sample.For instance, the dataset generated using the Seer Proteograph platform in a recent study with 345 samples extended to 4 TB of spectral data, which is comparable in size to the sequencing data produced by whole genome sequencing of a similar-sized cohort (16).

COMBINING PROTEOMIC AND GENOMIC DATA FOR THE UNDERSTANDING OF DISEASE
Availability of large populational scaled biobanks and cohorts with genomics and proteomic information has ushered in studies mapping genetic associations with soluble protein abundances (protein quantitative trait loci, pQTLs) akin to the analogous studies in gene expression (eQTLs) (3)a comprehensive up-to-date list of pQTL studies is maintained at http://www.metabolomix.com/a-table-of-all-publishedgwas-with-proteomics/.These studies have progressively increased in sample size and proteins measured and are extending beyond blood plasma to other tissue matrices such as cerebrospinal fluid (Fig. 3, Table 1 provides a current snapshot focused on blood proteomics pQTL studies).The increasing availability of pQTL studies has greatly enhanced our understanding of genetic architecture regulating plasma protein abundances in terms of heritability, local (cis) and distant (trans) are enabling genetic-based inference of proteome in large cohorts genetic architectures with insights into pathways regulating protein levels and protein-protein interactions, causal insights into potential protein targets/pathways driving disease risk through approaches such as colocalization and Mendelian randomization with implications for drug targeting and safety indications.Recently, genetic scores have been constructed from large sample sizes for molecular omics traits including proteomics without such information to generate putative insights before embarking on often costly and time-consuming data generation (22).It is also important to note that genetic-based approaches provide static imputation of average protein (and other omic) levels covering only the heritable component.The proteome is dynamic and changes depending on alterations in context (e.g., environment, lifestyle, physiology, infection)-which are not captured in static genomic information.Ways of capturing the non-genetic variation of the proteome using epigenomics have been recently suggested (23)(24)(25).

APPLICATIONS OF PROTEOMICS FOR DISEASE PREDICTION
The emergence of large multiplex proteomic data in populational cohorts is beginning to usher in unprecedented exploration in proteomics-based predictive models beyond traditional risk factors based on a limited set of biomarkers or proteins.Using traditional penalized regression models, multiple studies have shown the utility of multiplex proteomics in predicting demographic factors such as age (4,26,27), body mass index (4, 28), renal function (4, 29), health metrics (29) as well as disease (30)(31)(32).
Proteomic prediction models also facilitate the development of risk scores akin to polygenic risk scores for a multitude of diseases (33).Whilst the score training and testing are fundamentally not dissimilar, there are some unique differences in developing proteomic versus genetic risk scoresproteomic data is magnitude narrower than genomic data and thus has less model under specification limitations where the number of predictors greatly exceeds sample size; certain proteome levels undergo dynamic changes as the disease progresses or at disease onset and may also be influenced by a range of non-causal factors as opposed to the germline genetic variation that are invariably unchanged throughout life.It can be envisaged that a combined omics score may offer additional risk prediction benefits than any score alone, especially in light of the relatively low-risk variance explained for many disease genetic risk scores.Lastly, in view of potential drops in genetic risk score performance across ancestries (34), the portability of proteomic risk models across populations and ancestry remains to be estimated and are important validation steps for generalizable models.

CHALLENGES AND LIMITATIONS
Despite significant advances in the field, there remains a range of challenges that need to be addressed both from technical and scientific viewpoints as we look towards the horizon for future outlook and opportunities-summarized in Box 2 and expanded below.

Limited Correlation Between Assay Technologies
Due to practical and organizational considerations, many populational cohorts have predominantly adopted one largescale proteomics approach, largely affinity-based.However, the same protein measurement may differ between proteomic assay approaches and this would have an impact in the transferability and generalizability of proteomic models across technology.A recent head-to-head study between SomaScan and Olink showed a range of poor to high Spearman's correlation for overlapping proteins with a median from studies ranging from 0.20-0.44(35)(36)(37) whilst in a study comparing three affinity-based platforms in a COVID-19 cohort showed median Kendell correlation ranging 0.36 to 0.70 between those platforms (38).For proteins with disparaging measures, it is difficult to establish one as more accurate than the other in the absence of gold standards.A wide range of factors may contribute to the lack of correlation such as reagents binding to different forms of the target, interactions with other proteins/complexes/materials in blood, sensitivities to protein abundances etc.This problem would be expected to be more pronounced for assays targeting the lowest abundance blood proteins, contributing to larger CVs -as observed in the recent UKB-PPP study (4) -and potentially a higher likelihood of erroneous measurements due to off target cross reactivities.From the selection of candidates measured with the traditional ELISA approach, the Olink approach seem to concord better with ELISAs than SomaScan (35,36), which may reflect Olink's use of antibodies, like ELISAs, as the capture reagents as opposed to aptamers.Noteworthy here is also a recent comparison of the Olink and the Nulisa platforms with a multiplexed cytokine assay, demonstrating good pairwise correlations of the three platforms (39).In addition to differences within affinity-based proteomic assays, there may also be systematic differences between affinity-based and MS approaches that may be due to significant differences in

Promises and Challenges of Population Proteomics in Health
Mol Cell Proteomics (2024) 23( 7) 100786 5     Promises and Challenges of Population Proteomics in Health 10 Mol Cell Proteomics (2024) 23 (7) 100786 technology and sample processing steps prior to measurement with the instruments (40).

Distinguishing Between Technical and Biological Variation
There are also important distinctions between technical and biological reproducibility as opposed to concordance across platforms across studies.Both assays have been shown in multiple studies to have good reproducibility in terms of relatively low coefficients of variation and biological associations of the same assay approach across studies (4,18,35,36).It is important to note that models and associations derived from one approach may generalize less well across assay platforms compared to within.

Orthogonal Validation
Most large-scale studies, including both mass spectrometry and epitope scanning technologies, often include a few examples where an orthogonal approach is used to provide independent validation of a handful of proteins.This is largely a 'trust but verify' approach to large-scale discovery experiments such as using ELISAs to validate specific high-plex proteomic findings (35,36).However, ELISAs are not well suited for many targets due to interferences and typically lack multiplexing capabilities for a set of custom targets.Multiple reaction monitoring (MRM) mass spectrometry is the preferred 'gold standard' in these cases where stable isotope-encoded peptides are also employed as internal standards (41).This type of independent validation is deemed essential if one is to construct a clinical biomarker assay.That said, MRM-MS assays can take time and resources to set up properly, especially if stable isotope peptides are employed and the response curves are properly determined over a broad dynamic range.Fortunately, constructing and validating large sets of quantitative peptide-based assays using MRM-MS is becoming faster and less expensive due to the availability of synthetic peptide reagents (42) and more sensitive triple quadrupole instruments (41).

Absolute versus Relative Quantification
Owing to various technical and practical constraints, largescale wide proteomic measures are quantified on relative rather than absolute scales.For investigations such as pQTL mapping and associations looking for statistically significant effects without necessitating absolute effect sizes, relative quantification has been sufficient.However, absolute quantification offers additional value in having a more interpretable effect size and allows for calibration and comparability across studies which is extremely difficult across studies without bridging samples or making assumptions about the underlying samples.Operating on relative quantifications may create additional challenges in calibrating measures in additionally measured samples cross-sectionally and longitudinally as well as harmonizing across datasets/populations. Absolute quantification is also important in standardized reference ranges in the clinical setting as values need to be comparable across runs and centres.

Multiple Testing Considerations
Following in the direction of genomics, there needs to be additional scrutiny in accounting for multiple testing, akin to the candidate gene versus genome-wide testing era, to minimize false positives in proteomic association findings.For example, assessing one protein from a proteomic array and finding an association at nominal p < 0.05 for one outcome in a biobank cohort without replication data, is by far too lenient and may lead to replication difficulties for many findings.However, the multiple correction threshold can be challenging to set, since one may argue to consider the number of independent proteins tested, or the entire human proteome of ~20,000, or whether to consider potential isoforms and posttranslational modifications which brings up the potential testing space significantly.Additional empirical studies similar to the genomics field will be needed to determine the optimal analogous "proteome-wide significance" with emphasis on independent replication recommended.Sharing of full summary statistics in an easily accessible manner is increasingly more important to maintain consistency and further downstream utilization of these results.

Expected Technological Advances
Given the continued improvement in coverage and throughput, coupled with a reduction in cost, it is inevitable that studies with larger sample sizes and increased proteome Box 2. Summary of challenges and future outlook of populational proteomics.7) 100786 11 coverage will emerge to further empower proteomic association discovery both from proteogenetic and non-genetic perspectives.Indeed, some of the newer proteomic technologies based on spatial imaging, nanopores or Edman-like sequential degradation could well represent a breakthrough should they succeed in both speed and high multiplexing needed to for blood-based applications to surpass or be competitive with current technologies (43).Some of these newer technologies are also capable of single molecule detection, i.e., proteins or peptides, that promise unparalleled sensitivity as well as absolute quantification (44,45).SomaLogic and Olink epitope array platforms are likely to continue to increase their protein coverage as they have over the last few years, although it is not clear at what point they will reach a plateau due to specificity and accuracy limitations as they target lower abundance blood proteins.In the MS realm, recent innovations in nextgeneration mass spectrometry platforms that have incorporated time-of-flight (TOF) analyzers in combination with orbitrap analyzers or trapped ion mobility spectrometry (TIMS) have significantly reduced the time needed for plasma analysis without sacrificing depth or coverage (46,47).A recent example of such an approach employed a timsTOF MS analyzer along with a prior separation step utilizing a novel nanoparticle peptide enrichment scheme to successfully map over a hundred novel pQTLs (16).Indeed, the Orbitrap Astral and timsTOF Ultra from Thermo Scientific and Bruker respectively, are now becoming competitive with array-based methods in terms of increased sensitivity, comprehensive detections, and speed (16) (https://www.genomeweb.com/proteomics-protein-research/mass-spec-retakes-center-stage-2023-next-gen-proteomics-startups).

Post-translational Modifications
The existence of multiple proteoforms for any given protein presents serious and largely unmet challenges.Protein ratios and other combinations extend the permutation space even further (48).Splice variants, single amino acid polymorphisms, and PTMs such as phosphorylation and glycosylation, together can greatly expand the ensemble of proteins in the blood, as well as lead to significant differences among individuals or disease states.For antibody and aptamer screening platforms, this can make accurate detection difficult due to the occlusion or alterations in binding efficiencies to their target epitopes.Mass spectrometry methods have the potential to more readily identify and differentiate among proteoforms but require high peptide coverage or additional PTM-based enrichment steps to be highly effective.Progress in both these methodologies may eventually circumvent these current limitations, or alternatively, some of the newer proteomic technologies that rely on single protein or peptide detection may be better suited for this challenge.New insights may also come from more specialized technologies that target PTM directly, such as hydrophilic interaction ultra-performance liquid chromatography (HILIC-UPLC) to fine-map the human blood plasma Nglycome (49).

Other Tissue Types at Scale
While large-scale proteomics studies have been almost exclusively in plasma or serum, there are clearly other bodily fluid sources such as CSF, saliva, and urine (50,51), or nonliquid sources such as tumors, organ biopsies, or formalinfixed paraffin-embedded (FFPE) clinical samples (52), that offer alternative and perhaps superior information for specific diseases.Additionally, pQTL mapping studies have been performed at these smaller scales in non-blood-based contexts (53)(54)(55)(56)(57)(58)(59)(60)(61)(62) with potential larger studies emerging (63).In most non-liquid sample types, MS-based proteomics has the current advantage in coverage and adaptability and will likely be the main technology driving these types of studies.Exploiting the large historical collection of FFPE samples seems especially attractive given their clinical relevance, and one would hope to see more of these types of sample sets being analyzed as newer high-throughput MS workflows become more widely utilized (64,65).

Proteomics in Diverse Ancestries
Akin to the genomics field, the majority of large-scale proteomics studies have focused on European ancestries.Given the genetic and environmental diversities that may affect protein abundances, studies in other ancestry groups and regions will be highly informative in elucidating these effects with some biobank efforts emerging (37).Studies in other ancestry groups will also shed light on environmental and behavioral impacts on protein levels as well as improve the generalizability of proteomics risk scores.

Longitudinal and Physiological Proteomic Changes
In addition to disease states affecting protein abundances either acutely and/or transiently or chronically, certain proteins may be highly dynamic longitudinally in normal physiological states.Large-scale studies to date have mostly focused on cross-sectional snapshots of protein levels with variable time before or after disease as a result of sampling rather than by design.Longitudinal measures so far have been understudied; however, dynamic protein changes can be challenging to capture as variations occur on the scale ranging from hours in circadian patterns to months throughout the year influenced by environmental and seasonal changes and even years at different stages in life.There is also additional complexity arising of physiological differences between sex at various stages in life such as puberty, pregnancy, and menopause.Therefore, the frequency at which longitudinal samples need to be ascertained remains a challenge and is currently limited by the availability of longitudinal samples.

Combining Evidence Across Studies
Sharing or merging multiple proteomic data sets would clearly benefit the scientific community and lead to more rapid progress in identifying novel pQTLs and disease biomarkers.There are many difficulties to overcome including normalization and comparability across independent studies, especially if different proteomic platforms are being used.One partial solution to this problem was the formation of the SCALLOP consortium (http://www.scallopconsortium.com/)where academic users, now encompassing over 28 research institutions, share plasma proteomic data analyzed using Olink's platform primarily for pQTL discovery (66).More recently, an industry-centered SCALLOP initiative has been proposed for analyzing longitudinal plasma or serum samples in the control or placebo arms of clinical trials using the Olink and potentially other proteomic platforms (personal communication, Anders Mälarstig).Should these and other such collaborations become more common, one would expect to see exponential growth in the volume of publicly available bloodbased proteomic

Data Sharing and Legal Considerations
Lastly, some of the most challenging aspects of designing and implementing large-scale proteomics studies are not necessarily technical, but rather have financial, organizational, contractual, data access, and intellectual property implications.In the UKB-PPP consortium, the alignment of multiple pharmaceutical companies necessitated legal departments with sometimes conflicting objectives to negotiate among the consortium members, and then with both the UKB as the gatekeeper of the large bank of plasma samples and Olink as the technology and data provider to reach agreements on sample selection and handling, timing, and the crucial period of restricted data access before it was to be made public.As we expect to see even larger or more complex studies undertaken in the coming years, the logistical challenges of sample procurement alone will likely require the participation of several biobanks and hospital resources to increase the study's power and ethnic diversity.Funding such studies also pose serious challenges, and one could imagine that joint government and industry participation, along with nonprofit organizations, would be needed to bring the proper resources for such large-scale multiple-year initiatives.Finding proper solutions to these many technical and other challenges will hopefully lead to a more widely accessible proteomic database that will benefit private, industry, and governmentfunded scientists alike.

FIG. 1 .Box 1 .
FIG. 1. Common uses of large-scale proteomics studies in blood.A, cis and trans-pQTLs and target discovery, (B) identifying possible mechanisms of action including altered activities and perturbations in protein-protein interactions, and (C) discovery of disease-specific biomarkers using machine learning, proteomic risk score analysis and MRM-MS peptide-based biomarker panel data to examine changes in relative abundances of one or more proteins between healthy and disease cohorts.

FIG. 2 .
FIG. 2. Workflows for large-scale coverage of the serum or plasma proteome.Mass spectrometry-based workflows that are currently capable of quantifying up to 2000 or more proteins is shown in the top panel.Highly multiplexed epitope-based array technology workflows such as Olink and SomaLogic are shown in the bottom.Both workflows require access to biobank (see far left, such as the UK Biobank) or clinical hospital banked serum or plasma specimens that require large-scale processing and storage facilities.

FIG. 3 .
FIG. 3. Sample size and proteomic coverage of human plasma proteogenomic studies between 2008 and 2023.Studies with sample sizes >2500 or >2000 proteins measured are labelled.Red indicates MS-based proteomics, blue indicates affinity-based proteomics.MANOLIS and Pomak are part of the HELIC (HELlenic Isolated Cohorts study) cohort.